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Sequence alignments unambiguously distinguish between 
protein pairs of similar and non-similar structure when 
the pairwise sequence identity is high (>40% for long 
alignments). The signal gets blurred in the twilight zone of 
20-35% sequence identity. Here, more than a million 
sequence alignments were analysed between protein pairs 
of known structures to re-define a line distinguishing 
between true and false positives for low levels of similarity. 
Four results stood but. (JO The transition from the safe zone 
of sequence alignment into the twilight zone is described by 
an explosion of false negatives. More than 95% of all pairs 
detected in the twilight zone had different structures. More 
precisely, above a cut-off roughly corresponding to 30% 
sequence identity, 90% of the pairs were homologous; 
below 25% less than 10% were, (i£) Whether or not 
sequence homology implied structural identity depended 
crucially on the alignment length. For example, if 10 
residues were similar in an alignment of length 16 (>60%), 
structural similarity could not be inferred, (lif) The 'more 
similar than identical 9 rule (discarding all pairs for which 
percentage similarity was lower than percentage identity) 
reduced false positives significantly, (iv) Using intermediate 
sequences for finding links between more distant families 
was almost as successful: pairs were predicted to be 
homologous when the respective sequence families had 
proteins in common. All findings are applicable to auto- 
matic database searches. 

Keywords: alignment quality analysis/evolutionary conservation/ 
genome analysis/protein sequence aHgnmenl/sequence space 
hopping 



Introduction 

Protein sequence alignments in twilight zone 
Protein sequences fold into unique fhree- dimensional (3D) 
structures. However, proteins with similar sequences adopt 
similar structures (Zuckerkandl and Pauling, 1965; Doolittle, 
1981; Doolittle, 1986; Chofhia and Lesk,.1986). Indeed; most 
protein pairs with more than 30 out of 100 identical residues 
were found to be structurally trimflflr (Sander and Schneider, 
1991). This high robustness of structures with respect to 
residue exchanges explains partly the robustness of oa^anisms 
with respect to gene-replication errors, and it allows for 
the -variety in evolution (Zuckerkandl and Pauling, 1965; 
Zuckerkandl, 1976; Doolittle, 1979, 1986). Structure align- 
ments have uncovered homologous protein pairs with less than 
10% pairwise sequence identity (Valencia et al 9 1991; Holmes 
ei al. 3 1993; Holm and Sander, 1996; Brermer et al. 9 1996; 
Hubbard et al^ 1997). Indeed, most similar protein structure 
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pairs appear to have less than 12% pairwise sequence identity 
(Rost, 1997). Furthermore, the average sequence identity 
between all pairs of similar structures is supposedly 8-10%, 
and the observed distribution (Gaussian peaking around 8% 
identity) marks another region, the rmdmght zone (Rost, 1991). 
The mid night zone is populated by protein structure pairs that 
may have become similar by convergent or divergent evolution 
(Doolittle, 1994; Rost, 1997). Threading algorithms ultimately 
aim at revealing homologous pairs from the nudmght zone 
(Wodak and Rooman, 1993; Bryant and Ahschul, 1995; Sippl, 
1995; Rost and Sander, 1996; Sippl and Floeckner, 1996; 
Fischer et ol, 1996; Rost and O'Donoghne, 1997). Conven- 
tional sequence alignment methods become problematic at 
much higher values of sequence identity. Methods often fail 
to correctly align protein pairs with 20-30% pairwise sequence 
i d entity. Hence, Doolittle (1986) coined the term twilight zone 
for sequence alignments in this region. Do the (rifnoMes 
of alignment methods in this zone reflect merely technical 
difficulties (statistical significance of detection), or is the 
twilight zone denned by a particular feature of evolution? 

Length-dependent cut-off for significant sequence identity 
Pairwise sequence idmtity (percentage of residues identical 
between two proteins) is not sufficient to define the twilight 
zone. Instead, analysing the relatively small number of atrncture 
pairs available in 1990, Sander and Schneider (1991) defined 
a length-dependent threshold for significant sequence identity. 
The threshold curve defined (dubbed HSSP-curvej was roughly 
proportional to the inverse square-root of the length for 
alignments between 7 and 80 residues, and was clipped to 
saturate at 25% sequence identity over more than 80 residues. 
La 1990, no pair with more than 30 identical residues of 100 
aligned had different str u ct u res (Sander and Schneider, 1991). 
Was this still true far the five times larger PDB (Bernstein 
et aL 9 1977) of 1997? 

Hopping in sequence space 

If we could plot fiae space of protein sequences, would we 
observe the protein families as islands? Unfortunately, we 
cannot telL Nevertheless, useful information has been extracted 
from sequence (Casari et a/., 1995) and structure (Maiorov 
and Crippen, 1995) space. In everyday database searches, 
protein families are widened by exploiting the transitivity of 
homology (Pearson, 1996): (i) a query sequence U is aligned 
to a database, say SWISS-PROT (Bairoch and Apweiler, 1997); 
(ii) all sequences aligned at levels of significant suufiarity are 
used as new seeds Uj, and for each U; SWISS-PROT is 
searched again; (in) this procedure is repeated until no new 
sequences are found. Sequence space hopping may be used in 
combination with knowledge from structures to widen families 
(Holm and Sander, 1997), or to increase the infoimation 
contained in multiple sequence alignments input to prediction 
methods (Rost, 1996, 1997).Recentiy,metransitrvity of protein 
•families has been exploited successfully to automatically 
increase the yield in database searches [Ruben Abagyan 
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presented the 'mam-link recognition' method 1996 at the 
CASP2 meeting (Abagyan and Batalov, 1997); Park et al 
(1997) presented the 'intermediate sequence search* method 
and Neuwald et al (1997) implemented the same concept 
(Neuwald, et al, 1997)]. Here, I confirm the original findings 
based on a different data set, and analysed in detail bow the 
gain depended on the number of intermediate sequence, and 
their similarity. 

Here, I present results of aligning a set of 792 sequence- 
unique (no pair in set has more than 25% sequence identity) 
proteins of known structure against PDB. The following 
questions were investigated Is the number of protein pairs of 
non-similar structures proportional to the distance from the 
HSSP-curve (eqn 1), or do false positives increase more rapidly 
in the twilight zone? Is the curve defined by Sander and 
Schneider (1991) still valid? Would using sequence similarity 
rather than identity improve accuracy (as speculated by Schne- 
ider and Sander)? Finally, can the accuracy be improved for 
pair alignments by expert rules? The results verify, partially, 
earlier work based on a 1000-fold larger data set (Sander and 
Schneider, 1991). The novel aspects were (i) a definition of a 
threshold for similarity (eqn 2), and a refinement of the 
threshold for identity; (ii) an introduction of various expert 
rules. Aspects largely complementing other analyses were 
(Abagyan and Batalov, 1997; Park et al., 1997; Brenner et al., 
1998): (i) a large-scale evaluation of exploiting intermediate 
sequences (sequence-space-hopping); (ii) a detailed analysis 
of true and false positives providing estimates for accuracy 
and coverage of database searches; and (iii) a comparison with 
BLAST, one of the most popular methods for rapid databases 
searches (Altschul et al, 1990; Altschul and Gish, 1996). 

Methods 

Data set: 792 sequence-unique protein structures 
Protein databases are biased towards particular protein families. 
To reduce this bias, analyses are usually restricted to represent- 
ative data sets (Hobohm et al s 1992). Here, I chose the 
maximal set of sequence-unique proteins of known structure 
available in early 1997 (Holm and Sander, 1996). 'Sequence- 
unique* was defined as 'no pair in the set falls above the 
HSSP-curve (eqn 1; Sander and Schneider, 1991). As a rule- 
of-thumb, no pair had more than 25% pairwise sequence 
identity. Each of these proteins was aligned against the subset 
of PDB contained in the early 1997 release of the FSSP 
database of protein structure alignments (Holm and Sander, 
1996). This subset amounted in total to about 5646 protein 
chains. Obviously the second step (792 versus 5646) re- 
introduced bias into the results. However, akgning the 792 
sequence-unique pairs against themselves would not have 
yielded any result for most of the twilight zone analysed here. 
Thus, 792 versus 5646 was the best compromise in reducing 
bias and monitoring the biased region. The resulting test set 
was the largest possible set of proteins for which stoictura] 
information was available (and thus false and correct hits 
could be automatically distinguished). 

Generation of sequence alignments 

Protein pairs were aligned by two different program types, 
(i) Full dynamic programming as implemented in the Smith- 
Waterman (Smith and Waterman, 1981) based method 
MaxHom (Schneider, 1994) (McLachlan metric, with min- 
imum = -0.5, maximum = 1.00, and gap open = 3, gap 
elongation = 0.3); and (ii) quick database searches as imple- 



mented by the two versions of the BLAST series: BLAST? 
(Altschul et al, 1990; Altschul and Gish, 1996), and PSI- 
BLAST (Altschul et al, 1997). AD 792 unique proteins 
were aligned against all 5646 proteins from the PDB subset 
Alignments shorter than 10 residues were not considered, as 
identical polypeptides of up 10 residues are known to occur 
in different structure states (Kabsch and Sander, 1984; Cohen 
et al, 1993). Technical limitations (CPU time) required the 
restriction of the dynanuc-progimDming analysis to the best 
2000 hits for each of the 792 unique proteins. (Note: this 
restriction applied only to the frnal displayed alignment Of 
course, all possible combinations were explored initially by 
the alignment algorithm.) The resulting final data set comprised 
about 1.7 million pairwise alignments. For the comparison 
between the dynamic programxning and the BLAST methods, 
the data set had to be reduced to all pairs that were aligned 
by all methods compared (the problem was that neither 
BLASTP, nor PSI-BLAST could be forced to report absolutely 
wrong, Le. ALL pairwise alignments). 

Definition of sequence identity and sequence similarity 
(i) Pairwise sequence identity was defined by the percentage 
of residues identical between two aligned sequences (e.g. 
aspartic matching aspartic counts 1: D - D = 1; aspartic on 
glutamic was a non-match: D - E = 0). (ii) Pairwise sequence 
similarity was defined by the percentage of residues similar 
between two sequences (e.g. D - D ^ 1; and aspartic on 
glutamic was now considered a match: D - E > 0). Similarity 
' scores depend on the particular metric used to capture physico- 
chemical properties of amino acids (note: most amino acids 
are not considered 100% similar to themselves by typical 
metrices, as such metrices are based on log-odds, e.g. for the 
McLachlan metric only F, W, Y and C yield 100% self- 
similarity). Consequently, levels of sirnilarity are not directly 
comparable between different metrices. For comparability, I 
used the McLachlan metric (Gribskov et al, 1987) also used 
in the HSSP database (Schneider et al, 1997). In principle, 
there are two ways to convert sirnilarity into percentage values: 
(i) by normalizing the sirnilarity score by the mn-vi-mai possible 
score observed in a given metric (percentage residue sinularity); 
and (ii) by setting an arbitrary threshold of me sirnilarity score 
to distinguish shnilar-not similar and counting the percentage 
of residues that are similar according to this threshold (percent- 
age of similar residues). Again, I followed the practice of the 
HSSP database compiling the percentage residue similarity 
(normalized by maximal possible scares). When compiling 
percentages, the number of identical residues was normalized 
by the number of residues aligned, gaps were ignored. 

Standard of truth for structural similarity 
Similarity between two protein structures is not uniquely 
defined. Different structure alignment methods yield different 
scores (Alexandrov et al., 1992; Holm et al., 1993; Luo et al., 
1993; Orengo, 1994; Crippen and Maiorov, 1995; Gerstein 
and Levitt, 1996; Holm and Sander, 1996; Orengo and Taylor, 
1996; Zu-Kang and Sippl, 1996). Such differences can be 
substantial, as illustrated by differences between the expert- 
based database of structural alignments SCOP (Murzin ei al, 
1995; Brenner et al, 1996; Hubbard et al, 1997), and the 
automatically generated databases CATH (Orengo et al., 1993, 
1997) and FSSP (Holm and Sander, 1996). In general, FSSP 
tends to find more pairs of similar structure than do CATH 
and SCOP. However, this is only a trend. For many examples, 
SCOP finds structural similarity and FSSP does not Here, I 
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Fig. 1. Sketch of seqiience-space-hopping. The triangle defines three search 
proteins (A, B and C) having mutually less than 25% sequence identity. The 
circles Hftfin* the three families (all sequences ins jrift the circle indicated by 
arbitrary names aaa_fpeaes have more than 25% sequence identity to the 
respective search proteins A, B and C). Sequence-space-hopping implies 
joining the circles representing the protein families (as shown for proteins A 
and B in the striped circles) if they contain identical proteins that are 
aligned in the same region {abjcvb in the example given). 



chose the FSSP database 'a standard of truth': any pair for 
which FSSP listed a significant score [zDALI > 4 (Holm and 
Sander, 1996)] of structural similarity was considered to be 
structurally similar. In order to distinguish between true and 
false positives this decision implied that all pairs not listed at 
the given cut-off of the FSSP database were stracturalry not 
similar. However, this brought up the problem of different 
structure alignment methods. For example SCOP may consider 
a pair structurally similar, and FSSP may not Thus, additionally 
all pairs were excluded from the analysis that were listed in 
FSSP but with lower z-s cores. Even that still left pairs of 
proteins with clear levels of sequence identity (more than 
40%) which were not found listed in FSSP. Thus, I had 
to refine this procedure by semi-automatically checking the 
structural similarity for about 2000 protein pairs all of which 
had levels of above 30% pairwise sequence identity [note this 
number was negligibly small, as only 1% of all pairs were 
found above this value (Fig. 2B)!]. The particular way in which 
the standard-of-truth was constructed implied that estimates for 
true positives might be slightly optimistic, estimates for false 
negatives slightly pessimistic. 

Concept of true and false hits 

When Chothia and Leak (1986) first analysed the relation 
between sequence and structure similarity, they monitored the 
details of structural differences, and found that the differences 
are inversely proportional to . the level of sequence identity. 
The binary notion of * similar structure 1 (true or false) used in 
this analysis reflected a different focus: the goal was to estimate 
the accuracy in correctly detecting rather than in correctly 
aligning homologues. Did this imply mat correct detection and 
correct alignment were not correlated (as often the case for 
threading: Bryant and Altschul, 1995; Lemer et al. 3 1995; 
Sippl, 1995; Fisdher et al, 1996)? Not necessarily, but the 
fact is that two homologues can be detected although part- 
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Fig. 2. Explosion of structurally dissimilar pairs in the twilight zone. 
Numbers of true (pairs with similar structure) and of false positives (pairs 
with, no similar structure) plotted versus the distance to the HSSP-curve 
(Sander and Schneider, 1991), ie. the hnriynntni axes give the distance from 
the threshold denned in eon 1 (numbers refer to the parameter n in eqn 1). 
The levels of pairwise sequence identity corresponding to the distance were 
shown on top. (A) Number of pairs observed at any distance (logarithmic 
scale). (B) Cumulative number of pairs observed (logarithmic scale). Fox 
example, at a threshold corresponding to about 32% sequence identity for 
long alignments, the numbers of true and false positives were equal (arrow 
in A); at about 29% even the cumulative numbers of true and false positives 
were equal (arrow in B). Note: numbers of true negatives and false 
negatives result from the cumulative sums left of the threshold; percentages 
of true and false positives given in Figure 5. 



even the entire — alignment is wrong. (However, this extremely 
irritating point was not pursued further in this analysis.) 
The following cases were distinguished; (i) true positives, 
alignments between proteins of similar structure that fall above 
a given threshold (defined by the sequence alignment method); 
(ii) false positives, alignments between proteins of dissimilar 
structure that fall above a given threshold of the sequence 
alignment; (iii) true negatives, alignments between proteins of 
dissimilar structure that fall below a given threshold; and 
(iv) false negatives, alignments between proteins of similar 
structure that fall below a given threshold. Note that 'negatives' 
and 'positives* represent two sides of the same coin: at 
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Fig. 3. Pairwise sequence identity versus alignment Jp.ngth The ori ginal 
HSSP-curve (Sander and Schneider. 1991) (dotted circles, eqa 1) appeared 
to fit the true positives (homologues, A) better than the false positives (B). 
In contrast, the new curve proposed here (filled diamonds, eqn 2) was more 
conservative in excluding false positives. Note that due to the huge number 
of pairs the plots for true (A) and false (B) positives appeared almost 
equally densely populated (Figure 2 revealed the problem of such a scatter 
plot). 

any threshold extracted from the sequence alignment n, the 
following equations hold (for cumulative numbers): 

false negatives *f true positives = all pairs of similar structure 

true negatives + false positives = all pairs of 
rfticgiTnflar structure. 

Distance to HSSP threshold 

The HSSP-curve was originally defined by (Sander 
Schneider, 1991): 



and 



An) = n + 



290.15 ■ L-*- 562 , for L < 80 
25 , doe L & 80 



(1) 

where L gave the number of residues aligned between two 
proteins; p 1 the cut-off percentage of identical residues over 
the L aligned residues; and n described the distance in 
percentage points from the curve (n = 0 corresponds to the 
original HSSP-curve; n = 5 to the official HSSP database 
releases; curve plotted in Figure 3). Once Schneider and Sander 
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(1991) had discovered the basic functional dependence between 
sequence identity and alignment length, they merely had to 
fix two free parameters: the factor and the exponent Both 
were chosen to fit the data observed in 1991, in particular to 
reach values of 25% around alignment length of 80, and 
values of 100% around alignment length of 10. The principle 
functional dependence described by eqn 1 also follows 
from statistics, as was recently shown in an elegant work 
(Alexandrov and Soloveyev, 1998). Let p t ji = 1,..., 20) be the 
probability that amino acid i occurs in a protein, and my the 
score for randomly aligning two amino acids i and j. The score 
£ of an entire alignment can then be approximated by: 

S = <m> • L 

where <m> is the expectation value of my, and L the alignment 
length. If the values of my are independent, Gaussian distributed 
variables, it follows (after some elementary operations) that 
the relation between the standard deviation of the values of 
"ty fim )> and the resulting score distribution (Og) is: 

o m - Zr 0 * 5 ■ o fi 

In their original article Alexandrov and Soloveyev work 
out an appropriate re-scaling of the dynamic programming 
alignment However, this scheme cannot be applied after the 
alignment has been completed (as the threshold functions used 
in this work), rather it has to be implemented into the 
alignment method. 

New curve for length-dependent significance of pairwise 
sequence identity 

I attempted to solve the problems of the original HSSP-curve 
(eqn 1; Results) by defining the following curve for Hie 
separation of true and false positives (Figure 3, grey line with 
dotted circles): 



An 



+ 480 O + ^ooo) 



(2) 



where L gave the number of residues aligned between two 
proteins; p* the cut-off percentage of identical residues over 
the L aligned residues; and n described the distance in 
percentage points from the curve (n = 0 plotted in Figure 3). 
The constraints in visually selecting the final function were (i) 
to maintain the functional form defined by eqn 1 (and suggested 
by the statistics of Alexandrov and Soloveyev, 1998); (ii) to 
hit the 100% mark at alignments that are too short to reveal 
anything about stractural similarity (=11 residues); (iii) to 
saturate at levels around 20% sequence identity (reached for 
length — 300); and (iv) to rougjhly reflect the observed gradient. 
Saturation for long alignments was realized by the functional 
form of the exponent (note: the term + e" Ua resulted in an 
exponential decay). This 'saturation' constraint also afflicted 
the particular value of the factor (0.32 rather than about 0.5 
as suggested by the distribution of the data, Figure 4). 

New curve for length-dependent significance of pairwise 
sequence similarity 

The original HSSP-curve was derived for sequence identity, 
not for sequence similarity (Sander and Schneider, 1991). The 
functional dependence between similarity and length appeared 
comparable to the one between identity and length (Results). 
This prompted a similar definition for the separation between 
true and false positives based on similarity: 



p S(n = n + 420 • IT 0 * 335 * <» + e - £/2000 > 



(3) 
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Fig. 4. Pairwise sequence similarity versus alignment length. (A) Correctly 
detected structural homologues; (B) false positives. Open circles, original 
HSSP-curve (Sander and Schneider, 1991) (eon 1); filled triangles, new 
curve proposed here (eqn 3). 

where L gave fhe number of residues aligned between two 
proteins; denned cut-off for the percentage of residue 
similarity over the L aligned residues; and n described the 
distance in percentage points from fhe curve (n = 0 plotted 
in Figure 4). 

Sequence-space-hopping , 

Suppose proteins Aq and Bo were less than 25%. identical; 
family A is given by: {Aq, A\,... t A n } (such that all' proteins in 
fhe family A are more than 25% identical to A Q )\ analogously 
family B is given by: {B 0 , B u ..., B m }. Although A Q and B 0 
differed by more man 75%, it may well be true mat both were 
aligned to fhe same sequences, i.e. that for some i and j: Ai «= 
Bj. If this is the case, * sequence-space-hopping' refers to 
simply extending both families A and B to become: {A 0) ^i,..., 
A,,, B 0s B u ..., B m } (Figure 1). Technically, I described this' 
situation by compiling a simple matrix H(A,B) mat contained 
fhe number of overlapping proteins (i.e. those contained both 
in family A and B) between all proteins in the test set (792 
chains) and all proteins in fhe search set (5646 chains). For 
example, H(A,B) = 5 implied mat test protein A and search 
protein B had five identical proteins in their family alignments. 



Twilight zone of protein sequence fllignwimtt 

The family alignments were taken from the HSSP database 
(Schneider et al, 1997) with a cut-off at: HSSP-curve + 1094 
(n = 10 in eqn 1), i.e. for alignments longer than 80 residues, 
35% pairwise sequence identity was required All protein .pairs 
(A,B) in fhe twilight zone were investigated fox which H(A,B) 
was larger than zero. Note, fhe concept of sequence-space- 
hopping explored here is being used in everyday sequence 
analysis. The novel idea introduced by others (Abagyan and 
Batalov, 1997; Neuwald et al> 1997; Park et al, 1997) was 
NOT to use sequence-space-hopping, but to use it for re (hieing 
false positives in large-scale sequence analysis. Here, I simply- 
applied mis concept was applied to the large data set explored, 
and investigated its usefulness in dependence on various 
parameters. 

More-similar-than-identical rule 

A simple rule-of-thumb was explored: accept hits only if me 
level of sequence sinularity was higher than the level of 
sequence identity. This rule may appear to be non-selective in 
mat similarity would always be larger than identity; however, 
for fhe given definition of similarity (using the McLachlan 
metric), this was not fhe case. * 

Results 

Number of false positives exploded in twilight zone 
In contrast to 1990, when Sander and Schneider (1991) 
compiled their data, now protein pairs of dissimilar structure 
were detected above fhe 30% cut-off (Figure 2A). And these 
were not exceptions: at a level of 32% (HSSP-curve + 7%, 
Le. n <= 7 in eqn 1), fhe number of false positives already 
equalled mat of homologues. For me original HSSP-curve fhe 
number of false positives was 20-fold higher than the number 
of true pairs. The transition from 20 to 30% sequence identity 
was highly non-linear for true, and false positives (logarithmic 
scales in Figure 2): the number of true pairs rose by a factor 
of 5, that of false pairs by a factor of 200 (Figure 2B). Thus, 
below fhe region of significant pairwise sequence identity 
(>34%) fhe population of false positives exploded. Howevei, 
also fhe vast majority of homologues had less than 30% 
sequence identity. 

Functional shape of original HSSP-curve adequate 
The functional shape of fhe original HSSP-curve proved to be 
basically correct. (Figure 3, grey line with triangles). However, 
fhe larger data set analysed here revealed several problems in 
detail (Figure 3B). (i) A threshold of 25% was not reasonable 
for an alignment length below 150-200 residues, (ii) Above 
an alignment length of about 100 residues, the derivative of 
fhe curve separating true and false positives should be lower 
man at lengths below 80. 1 attempted to solve these problems 
by defining a new curve for separating true and false positives 
. (eqn 2; Figure 3, grey line with dotted circles). The particular 
functional form guaranteed an approximate saturation for long 
alignments. For alignments shorter than 11 residues eqn 2 
yielded values above 100%. However, this was acceptable as 
100% identity for fragments of 10-11 residues does not imply 
stmctural similarity (Cerpa et a/., 1996; Minor and Kim, 1996; 
Mufioz and Serrano, 1996). The new curve saturated around 
20% for alignments over more man 250 residues. 

Defining a curve for pairwise sequence similarity 
Compiling sequence identity neglects fhe physico-chemical 
nature of amino acids. Any multiple sequence alignment 
iUustrates that, for example, the feature hydrophobicity is more 
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Fig. 5. Accuracy and sensitivity for detecting homologies in the twilight zone. How to choose the cut-offline for automatic database searches? The graphs 
A-D illustrate the pros and cons of particular choices. Given are the cumulative numbers of correctly detected homologues (true positives, A), and of false 
positives (B), as well as, the cumulative percentages of all correctly detected homologues (true positives, C), and of all homologues that were' missed (false 
negatives, D) in dependence of the cut-off distance from the thresholds defined in eqn 1-3 (parameter n). Thresholds: (1) HSSP-curve (eqn 1), (2) new curv 
for sequence identity (eqn 2), (3) new curve fox sequence similarity, (4) subset of proteins for which similarity is larger than identity (grey line in D: false 
negatives for this subset), (5) simple cut-off according to sequence identity disregarding alignment length (as often used in practice). Note: counts of true * 
positives for the simple sequence identity cut-off (no alignment length) did not even fall into the interval displayed. 



conserved than is the residue type. For the million protein 
pairs investigated here, this was reflected in a shift of the 
scatter plot towards lower percentages (Figure 4). In particular, 
for longer alignments false positives fall below 15% pairwise 
sequence similarity. This prompted the introduction of a 
threshold specifically for sequence similarity (eqn 3 in 
Methods; Figure 4, grey line with dotted circles). The curve 
surpassed 100% for alignments shorter than 12 residues and 
saturated at about 10% for alignments over more than 500 
residues. 

Better detection of homologues in twilight zone by new 
curves 

The new curves for length-dependent cut-offs in sequence 
identity (eqn 2) and similarity (eqn 3) resulted in clearly lower 
false positive rates (higher accuracy) than the original HSSP- 
curve (Figure 5B and C). This was paid for by a lower number 
of true positives detected (lower coverage; Figure 5A). At the 
n = 0 (eqn 1-3), the old curve yielded about twofold more 
true positives, but more than 20-fold more false positives 
compared to the new curves for identity and similarity. Further- 
more, at any level of true positives detected, the number of 
false positives was smaller for the new curves (eqn 2-3) than 
for the original HSSP-curve (eqn 1 ; Figure 7). When applying a 



cut-off according to mere sequence identity (ignoring alignment 
length), accuracy dropped below 10% at levels of 30% sequence 
identity (Figure 5C). Thus, detection accuracy rose almost 
10-fold by the new curves. 

Improving detection accuracy by expert rule 
Experts often apply rales-of-thumb to visually distinguish true 
and false positives. However, many of such simple rules 
appeared not valid for automatic implementation. In particular, 
the distributions of the number and length of insertions did 
not, on average, differ between false and true positives (data 
not shown). Detection accuracy improved niarginally by apply- 
ing the following rules: (i) compile the distance for the 
similarity score (eqn 3), and the identity score /r 7 (eqn 2), 
average over both ([n* + n 1 ]/!), and accept pairs when this 
average is above some threshold n\ (ii) take pairs whenever 
either idratiry or similarity surpassed the respective threshold 
(either TJ n J > n)\ (m) take pairs if both values where above 
a given cut-off (/r 5 U n J > n). In contrast, detection accuracy 
increased significantly by applying the 'more-srmilar-than- 
identicar rule: accept hits found in a database search only if 
percentage similarity is larger than percentage identity. This 
constraint resulted in >98% detection accuracy at n — 0 cut- 
off levels (eqn 2-3), while 2-4-fold less true positives were 
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Fig. 6. Improving accuracy by sequence-space-hopping. Distances were 
compiled according to the old curve (eqn 1, 'old'), and to the new curve for 
identity (eqn 2, 'ide'). Corresponding levels of sequence identity shown on 
•top. The cumulative percentages of true positives detected at a given cut-off 
distance were compiled for three different hopping strategies: hits were 
accepted if, at least, one (H(A3) = 1), five (H(A3) = 5) or 10 (H(A3) = 
10) proteins were common between two protein families (Methods). 

(A) Cumulative percentage of true positives (false positives = 100 - true); 

(B) cumulative number of true positives. The comparison of the true 
positives reached by intermediate sequences and all true positives (grey line 
in B, note: same as in Figure 2) showed that (i) less than 171000 of the true 
positives were reached by intermediate sequences; (ii) the number of pairs 
reached by intermediate sequences did not explode in the twilight zone 
(scale on the left covers two orders of magnitude, mat on the right only 
one). Numbers for true and false negatives would not make sense for this 
analysis: as we don't .know all proteins, we cannot conclude that two 
families are unrelated, only because we don't find a link between them. 



found at this level (Figure 5A and C). Hence, applied as a 
conservative cut-off in automatic database searches, this rule 
proved rather powerful. 

Improving detection accuracy by sequence-space-hopping 

Hopping in sequence space proved successful in discarding 
false positives. Already the minimal constraint to accept a pair 
if at least one protein was common between the two sequence 
famili es yielded levels of around 80% accuracy even down 
to cut-off levels corresponding to 20% sequence identity 
(Figure 6A, compared with <20% accuracy for the normal 
thresholds Figure 5C). Accuracy increased further when more 
proteins were required to be common to both families 
(Figure 6A). However, sequence space hopping was possible 
for only relatively few protein pairs (Figure 6B). Furthermore, 
the improvement in accuracy was less clear using sequence- 
space-hopping than by applying the ' more- sirnilar-than-ident- 
icaT rule (Figure 5). 



Accuracy versus coverage for BLAST and full dynamic 
programming 

The balance between accuracy (percentage of true pairs) 
and coverage (percentage of all true pairs) enables choosing 
automatic thresholds according to a particular purpose of a 
database search. It also permits comparing different methods 
(the higher the values, the better), (i) As expected, the 
commonly used simple level of sequence identity (disregarding 
alignment length) proved, again, an extremely bad choice, 
(ii) Surprisingly, the fast database searching method BLAST 
performed relatively well in comparison to the full dynamic 
programming (Figure 7A). (iii) Both BLASTP version 2 
and PSI-BLAST were almost as good as the full dynamic 
prog rammin g with the previously denned HSSP-threshold 
(Sander and Schneider, 1991). (iv) Best performance was 
achieved by the new threshold for similarity (eqn 3). (v) How- 
ever, the raw alignment score performed almost as well 
(vi) BLASTP (Altschul et al, 1990) performed rather similarly 
to the more elaborate and more recent PSI-BLAST (Altschul 
et aL, 1997) (and for 'high' accuracy even slightly better, 
Figure 7A inset; note: given that standard parameters were 
chosen, this was not surprising). The corresponding thresholds 
were given in Figure 5B for the dynamic programming, and 
in Figure 7B for the PSI-BLAST probabilities. 

Many false negatives at reasonable cut-off values 
The number of false negatives is often of interest, Le. the 
number of proteins that belong to a structure family but were 
not detected above a given cut-off. For the data sets used here, 
the cimTularive percentage of false negatives was extremely 
high for all reasonable cut-off levels (Figure 5D). The vast 
majority of all pairs of proteins with frimilar structure populate 
the midnight zone below 10% sequence identity (Rost, 1997). 
Thus, the extremely high false negative rates proved that 
methods aligning two proteins merely based on the pairwise 
levels of sequence homology clearly fail to find the gold mine 
of database searches (and that older analyses that , failed to 
describe this effect were based on biased data sets). 

Thresholds for practical use 

For simplicity the functions (eqn 1-3) were explicitly provided 
in tables (Rost, 1998). At levels of n = 0 (eqn 1-3) the 
cumulative number of true positives were (Figure 5): HSSP- 
curve (eqn 1), 12%; new identity curve (eqn 2), 56%; new 
sirmlarity curve (eqn 3), 73%. In order to achieve levels of 
99% correct hits m percentage points have to be added to the 
curves, where m was HSSP-curve, m = 8; new identity curve, 
m = 5; new similarity curve, m = 12. For comparison, 
applying the 'more-sirnflar-tban-identical' rule yielded levels 
above 99% down to m - -1. 

Conclusions 

Rapid transition from trivial to needle-in-haystack problem 

The twilight zone of sequence pair alignments (20-35% 
pairwise sequence identity) was characterized by two non- 
linear transitions, (i) The number ofhomologues (true positives) 
rose by a factor of about eight (Figure 2A). I obtained a 
similar result from analysing the first four entire genomes 
(Rost, 1997) which indicated that this result was general, rather 
than database dependent, (ii) The number of false positives 
rose by a factor of 5000 (Figure 2B). Hence, separating true 
and false positives switched from a trivial task (above 35%) 
to the problem of finding needles in a haystack (20-30%). 
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FLg- 7. Accuracy versus coverage for various methods and thresholds. 
Accuracy was defined as the cumulative percentage of true positives (actual 
true/all actual), coverage as the percentage of true positives that were 
detected at a given threshold (actual true/all true). (A) Thresholds and 
methods showed: hidentiry, new threshold for length-dependent sequence 
identity (eqn 2); ^similarity, new threshold for length-dependent sequence 
similarity (eqn 3); HSSP-curve, curve proposed by Sander and Schneider 
(1991; eqn 1); %identiiy\ threshold given by sequence identity alone, i.e., 
disregarding alignment length; alignment score, score used for the dynamic 
programming optimization MaxHom; blastl , B LAS TP version 2 (Aitschul 
and Gish, 1996); psUbiast, BLAST? version 3 (Aitschul et a/., 1997), run 
with standard parameters. The values for the BLAST methods were based 
on the probability scores reported by these algorithms. The BLAST methods 
did not report all pairwise alignments, thus the data set had to be reduced to 
the subset for which aligned pairs were reported by all three methods 
(MaxHom, BLASTP2, BLASTP3). Note that whereas the curves for the 
BLAST methods, as well as for identity and similarity are likely to hold up, 
in general, the curve for the ali gn man* score is valid fox the particular 
implementation of the dynamic programming in MaxHom, and for the 
particular choice of parameters (Methods). (B) Detail of the relation 
between the BLAST probability (here for p si-blast), and the cumulative 
number of true/false hits, as well as percentage accuracy and coverage. 



The explosion of false positives shed light on 1he shape of 
sequence space. From 100-35% sequence identity, any residue 
exchange resulting in a stable structure maintain s structure. 
.From 28-35% sequence identity, most residue exchanges 
Trifltntflrn structure. From 20-28% sequence identity, the 
absolute majority of residue exchanges forming stable struc- 
tures populate different protein families. Is the explosion 
caused by features of structure space? If one generates protein 
sequences at random (or randomly superposes non-related 
proteins), the counts for most of the region above 10% 
sequence identity are negligible (Rost, 1997). Thus, although 



it is obvious that we expect to find more pairs for lower levels 
of sequence identity based on mere statistics, the particular 
transition in the twilight zone seems not to be evident However 
this analysis did not provide answers to whether or not the 
observed explosion may reflect structural (Chung and Subbiah, 
1996) and/or functional constraints. 

Poor distinction between true and false positives by 
sequence identity, alone 

Even journals such as Cell, or EMBO provide an ample source 
for me following fallacy: 'these two fragments of 16 residues 
adopt similar structures as they have more than 10 similar 
residues*. Thus, one of the most important messages of this 
analysis might be me repetition of a point made by others 
(Sander and Schneider, 1991): high levels of sequence similar- 
ity or identity do not ascertain structural similarity (Figure 5). 
Instead, the levels of significant sequence identity and similarity 
depend on the alignment length (Figures 3 and 4), or the 
respective raw score of the alignment methods. 

Better distinction by new curves for sequence identity and 
similarity 

The length-dependent cut-off for significant sequence identity 
pioneered by Sander and Schneider (1991) needed refmement 
in several ways to account for the findings from a 1000-fold 
larger data set (i) shift towards higher values for shorter 
alignments; (ii) saturation for alignments longer than 150 
residues; (iii) definition of new curve for levels of sequence 
similarity. These tasks were solved by introducing threshold 
curves for significant sequence identity (eqn 2), and for 
significant sequence similarity (eqn 3). The precise definition 
of the two thresholds was entirely empirical. However, the 
essential functional dependency of the curves was kept Rimflflr 
to what would be expected from pure statistical considerations. 
Although not true for all problems (Nielsen et al, 3 1996), on 
average, sequence similarity was marginally more successful 
than identity in d^stmguishing true and false positives. The 
new curves improved accuracy at a given coverage (Figure 5 
and 7). Additionally, this analysis supplied detailed levels for 
expected accuracy and coverage for the curves defined, as 
well as for standard BLAST searches (Figures 5 and 7). 
Such estimates may have implications for automatic database 
searches. They also shed light on the comparison between 
sequence alignments and threading techniques that both only 
make use of pair comparisons (rather than using family specific 
profiles): already at levels of 25% sequence identity, pair 
alignments detect only 10-30% true positives. This is below 
the level of what threading techniques achieve in the interval 
0-25% sequence identity (Sippl, 1995; Fischer and Eisenberg, 
1996; Russell et a/., 1996; Rost et a/., 1997); 

Improved accuracy by 'more-similar-than-identical' rule and 
sequence space hopping 

The number of false positives was significantly reduced by 
two techniques (only the first of which was novel to this 
work), (i) The *more-similar-trian-identical s rule: 95% of all 
pairs for which percentage similarity was larger than percentage 
identity had similar structures. Thus, this constraint clearly 
improved detection accuracy. The cost was low coverage: for 
only 10% of the structurally similar pairs the percentage 
similarity was larger than percentage identity. This might be 
explained by the fact that half of the protein, on average, 
embedded in loop regions, may tolerate residue exchanges that 
do not conserve physico-chemical properties (and thus decrease 
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the overall average more than the few to-be-conserved-regions 
increase it), (ii) The usage of 'multi-links* (Abagyan and 
Batalov, 1997), 'mtermediate sequences' (Park et al, 1997), 
t transitivity J (Neuwald et al., 1997), or 'sequence space 
hopping': most protein pairs that contained a similar subset of 
identical proteins in their respective sequence families were 
found to have similar structures even at low levels of sequence 
homology. Obviously, the validity of transitivity (detection 
accuracy) between protein families (Figure 1) depended on 
the distance between the families (Figure 6). Interestingly, the 
improvement of accuracy hardly depended on the number of 
proteins required to be common to two families. This suggested 
that although the vast majority of protein pairs with 25% 
sequence identity had dissimilar structures, me 'islands' popu- 
lated by structure families were well separated. Unfortunately, 
for the data set explored here, the yield of this analysis was 
found to be very low: on average only one in 1000 pairs was 
reached via intermediate sequences (Figure 6). Furthermore, 
sequence-space-hopping resulted in clearly lower coverage/ 
accuracy ratios than did the application of the 'more-similar- 
than-identicaT rule (Figures 5 and 6). 

Beginning of the 90's: over-estimation of sequence alignment 
methods 

Until 1996, very few people had taken up the laborious 
task of objective large-scale analyses of protein sequence 
comparisons. Partially, because automatic structure comparison 
methods are fairly recent The few earlier workers (Sander 
and Schneider, 1991; Vogt et aL, 1995; Gotoh, 1996) based 
their work on data sets of about 1000 pairs of protein structure 
alignments. Gotoh (1996) and Vogt et al. (1995) used the same 
set (Pascarella and Argos, 1992) for testing different alignment 
methods, and a variety of substitution metric es. They focused 
on. monitoring the detailed accuracy in terms of number of 
residues correctly aligned. Due to the small data set Vogt et al 
(1995) found about 98% true positives at 30% sequence 
identity (ignoring alignment length), and 50% true positives 
at 20% sequence identity. For the 1000-fold larger <fata 
set used here the corresponding values were quite different 
(ignoring alignment length): 11% true positives at. 30% 
sequence identity, and 5% true positives at 20% identity. 
However, even the more conservative analysis introducing 
the importance of alignment length for levels of significant 
sequence identity (Sander and Schneider, 1991) still over* 
estimated me possible levels of sequence identity between 
proteins of dissiinilar structure. 

End of the 90s: database searches do not reach the 
gold mine, yet 

The thresholds for sequence identity and sinularity denned 
here, as well as those established by others (Abagyan and 
Batalov, 1997; Brenner et aL, 1998) complemented the levels 
for * significance ' provided by BLAST (Altschul and Gish, 
1996), FASTA (Pearson, 1996) or other statistical analyses 
(Bryant and Altschul, 1995) by addressing the question *how 
significant is the significance of the respective alignment 
method?'. Based on quite different data sets the principal 
messages were similar, (i) most proteins of similar structure 
were not found by pairwise sequence comparisons at reasonable 
cut-off thresholds; (ii) raw scores from dynamic programming 
methods were comparable to the original length-dependent 
cut-off thresholds for sequence identity (Sander and Schneider, 
1991); (iii) dynamic programming was only slightly superior 
to BLAST searches (Altschul and Gish, 1996; Altschul et al, 



1997). However, in detail the numbers differed between the 
recent analyses. Obviously, the absolute values depended 
crucially on the particular choice of the data set. Abagyan and 
Batalov (1997) analysed various substitution metrices on a 
data set comparable to the one used in mis analysis. They 
concluded that raw alignment scores provide better separations 
between true and false positives than do length-dependent 
cut-offs for sequence identity and similarity. The difference 
between their result, and the one shown here may result from 
the fact that Abagyan and Batalov (1997) used the optimal 
choice of all parameters for comparing the raw ali gnment 
score to sequence identity and similarity. Brenner and co- 
workers have analysed the accuracy and coverage for various 
statistical scores (Brenner et al, 1998). They used a completely 
different data set than I did. An approximate comparison of 
the two analyses was possible by the reference point of 
simple identity (ignoring alignment length). It seems that the 
performance for the best separation method they find (new 
FASTA) was comparable to the improved, simple thresholds 
denned here (eqn 2-3). Here, the BLAST probability was 
found to be a relatively good way to separate true and false 
positives (Figure 7A): it was only slightly inferior to the raw 
dynamic prograrnming alignment score, results for which hold 
up exclusively for the particular choice of parameters and tie 
particular alignment algorithm used. 

Thresholds in practice 

The advantages of the length-dependent levels of identity and 
similarity (eqn 2-3) over other thresholds (Abagyan and 
Batalov, 1997; Alexandrov and Soloveyev, 1998) was that 
these thresholds, in principle, are applicable to any alignment, 
and may relate more explicitly to structure. Identity and 
similarity can be compiled easily without having to re-do the 
entire database search. In practice, this does not always hold 
up: (i) different parameters (e.g. the way in which gaps are 
treated) may result in different alignments; and (ii) the similar- 
ity values compiled hold for the choice of a particular metric 
(here McLachlan). Additionally, the thresholds introduced here 
provide independent evidence for the separation, and permitted 
the application of the successful 'more-siniflar-than-identical 1 
rule. 

Will the analysis hold up for the next 500 structures? 
The results given here based on the largest possible data 
set for which structural alignments provided a well-denned 
distinction between true and false. One conclusion was that 
seven years ago (Sander and Schneider, 1991) the database 
was too small to capture the details. Will mis also be true in 
2005? Answers have to remain speculative, (i) Although the 
database used in 1990 was 1000-fold smaller man the one 
used here, some principle findings were verified, (ii) Assuming 
that there are only 1000 folds in nature (Chothia, 1992), and 
that these correspond to about 10 000 families, then even the 
full catalogue of all protein sequences would yield a data set 
essentially only 30 times larger than the one used here (note: 
the data set used corresponded to about 300 different folds 
aligned against about 1000 families). 

Rather more accurate, or more sensitive? 
An accurate and sensitive distinction between true and false 
positives is important for automatic database searches. The 
new curves introduced here (eqn 2-3) proved slightly more 
sensitive (higher coverage) and more accurate than the previ- 
ously proposed curve (Sander and Schneider, 1991). The 
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accuracy increased significantly by applying the 4 more- similar- 
than-identical' role, and by sequence space hopping. However, 
accuracy was gained at the expense of coverage. "Which is more 
important? Clearly, the evolutionary information contained in 
multiple alignments is the single most important contribution 
to improving protein structure prediction in the 90 's (Rost and 
Sander, 1996; Rost and O'Donoghne, 1997). Is the gain by 
increased diversity more important than the loss of accuracy 
when using alignments for structure prediction? The answer 
. depends on the particular prediction goal For example, for 
secondary structure prediction diversity is more important than 
accuracy (cut-off at 25% versus that at 30%), whereas for 
the prediction of solvent accessibility the opposite is true 
(unpublished). Furthermore, as databases grow coverage may 
be less important than accuracy. Irrespective of individual 
preferences, the sharper the knife cutting between true and 
false positives, the better. This analysis has sharpened the 
knife a little, and added new optional tools to it 
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